Bring CUDA support to `Tracking.jl` #33

coezmaden · 2021-11-14T09:40:12Z

This brings (the long awaited) GPU functionality to Tracking.jl. At the moment only CUDA.jl.

Notes / Features:

Compatibility with JuliaGNSS/GNSSSignals.jl#19
Opt-in GPU signal processing
Monolithic CUDA.jl kernel combining carrier and code replication, downconversion, and correlation for arbitrary num_samples, num_ants, num_correlators. (arbitrary as long as the hardware allows)
Conditional unit testing (only if CUDA.functional())
Updated README.md with a GPU example
Updated README.md for correct usage after Speculative: Move PRN to TrackingState #31

This is a first fully functioning version. Optimizations regarding the performance of the CUDA kernel will follow. This version will be considered a baseline.

Update forked master due to hotfix

Incorporate the hotfix from the master repo

…com/ozmaden/Tracking.jl into feature/gpu-accelerated-correlation

Compat LoopVectorization

…JuliaGNSS-master

codecov-commenter · 2021-11-14T09:46:17Z

Codecov Report

Merging #33 (724dcf6) into master (5c32278) will decrease coverage by 5.81%.
The diff coverage is 84.00%.

@@            Coverage Diff             @@
##           master      #33      +/-   ##
==========================================
- Coverage   90.59%   84.77%   -5.82%     
==========================================
  Files          16       17       +1     
  Lines         404      473      +69     
==========================================
+ Hits          366      401      +35     
- Misses         38       72      +34

Impacted Files	Coverage Δ
src/Tracking.jl	`100.00% <ø> (ø)`
src/carrier_replica.jl	`100.00% <ø> (ø)`
src/downconvert_and_correlate.jl	`33.33% <ø> (ø)`
src/tracking_loop.jl	`97.50% <82.35%> (-2.50%)`	⬇️
src/tracking_state.jl	`95.34% <87.50%> (-1.80%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5c32278...724dcf6. Read the comment docs.

coezmaden · 2021-11-14T10:33:28Z

A comment on the last commit 90358a9. It checks inside the track function if the signal and codes are of the same type in the GPU processing case. Here are the two examples where an error is thrown:

julia> signal_cuarray
2500-element CuArray{ComplexF32, 1, CUDA.Mem.DeviceBuffer}:

julia> results_gpu = track(signal_cuarray, state_gpu, sampling_frequency)
ERROR: ArgumentError: Signal is not a StructArray, initialize the signal properly and try again.

julia> fake_struct = StructArray(signal_cpu)
2500-element StructArray(::Vector{Float32}, ::Vector{Float32}) with eltype ComplexF32:

julia> results_gpu = track(fake_struct, state_gpu, sampling_frequency)
ERROR: ArgumentError: Signal and GNSS codes are not of the same type. Please check if CPU or GPU is used.

It comes with a slight penatly on the performance in the GPU case, but I think it's manageable.

#BEFORE THE COMMIT
julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  81.162 μs …  26.223 ms  ┊ GC (min … max): 0.00% … 41.79%
 Time  (median):     84.806 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   98.184 μs ± 439.458 μs  ┊ GC (mean ± σ):  3.16% ±  0.71%

   █▄                                                           
  ▅██▅▄▄▄▄▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▂▂▂▂▂▂ ▂
  81.2 μs         Histogram: frequency by time          177 μs <

 Memory estimate: 29.80 KiB, allocs estimate: 418.

#AFTER THE COMMIT
julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   83.498 μs …  26.311 ms  ┊ GC (min … max): 0.00% … 40.27%
 Time  (median):      89.555 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   111.975 μs ± 519.990 μs  ┊ GC (mean ± σ):  3.66% ±  0.80%

  █▇▄▂▁▁▁▁                                                      ▂
  ███████████▇▇▇▆▄▆▆▅▆▆▇▅▄▄▅▄▄▅▄▃▃▅▄▃▃▃▁▄▁▄▃▃▄▁▁▃▃▁▃▄▁▃▅▄▄▅▄▃▄▄ █
  83.5 μs       Histogram: log(frequency) by time        460 μs <

 Memory estimate: 29.80 KiB, allocs estimate: 418.

No regression for the CPU:

#BEFORE THE COMMIT
julia> @benchmark track($signal_cpu, $state_cpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):  8.174 μs …  2.401 ms  ┊ GC (min … max): 0.00% … 99.11%
 Time  (median):     8.553 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.361 μs ± 32.195 μs  ┊ GC (mean ± σ):  4.84% ±  1.40%

   ▂█▅                                                        
  ▃████▅▅▄▃▃▄▄▄▄▅▅▄▃▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▁▂▂▂▂▂▁▁▂▁▁▁▂▂▂▂ ▃
  8.17 μs        Histogram: frequency by time        13.1 μs <

 Memory estimate: 5.12 KiB, allocs estimate: 94.

#AFTER THE COMMIT
julia> @benchmark track($signal_cpu, $state_cpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 3 evaluations.
 Range (min … max):  7.917 μs …  2.161 ms  ┊ GC (min … max): 0.00% … 99.45%
 Time  (median):     8.399 μs              ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.155 μs ± 30.340 μs  ┊ GC (mean ± σ):  4.67% ±  1.40%

    ██▆▁                                                      
  ▂▇████▅▄▄▅▅▅▆▆▅▄▄▃▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▂▁▂▁▁▂▂▂▂▂▂▂▂▂▁▂▂▂▂ ▃
  7.92 μs        Histogram: frequency by time        13.6 μs <

 Memory estimate: 5.12 KiB, allocs estimate: 94.

zsoerenm

I have some minor changes. Otherwise looks good!

zsoerenm · 2021-11-16T14:30:22Z

src/carrier_replica.jl

+Floating point CPU StructArray carrier generation
+"""
+function gen_carrier_replica!(
+    carrier_replica::StructArray{Complex{T},1,NamedTuple{(:re, :im),Tuple{Array{T,1},Array{T,1}}},Int64},
+    carrier_frequency,
+    sampling_frequency,
+    start_phase,
+    carrier_amplitude_power::Val{N},
+    start_sample,
+    num_samples
+) where {
+    T <: AbstractFloat,
+    N
+}
+    sample_range = start_sample:num_samples + start_sample - 1
+    @views @. carrier_replica.re[sample_range] = 2pi * (sample_range) * carrier_frequency / sampling_frequency + start_phase
+    @. carrier_replica.im[sample_range] = sin(carrier_replica.re[sample_range])
+    @. carrier_replica.re[sample_range] = cos(carrier_replica.re[sample_range])
+    return carrier_replica
+end
+


I guess this is a left over from previous tests. I think this can be removed.

zsoerenm · 2021-11-16T14:30:48Z

src/correlator.jl

@@ -124,7 +124,7 @@ Get prompt correlator
 function get_prompt(correlator::AbstractCorrelator, correlator_sample_shifts)
    correlator.accumulators[get_prompt_index(correlator_sample_shifts)]
 end
-
+CUDA.dot


What's that?

zsoerenm · 2021-11-16T14:39:45Z

src/tracking_state.jl

+It is determined based on the signal, which is given in the track
+function.
+"""
+struct CarrierReplicaGPU{


The CarrierReplicaGPU is never used, isn't it? I guess we can get rid of this structure.

zsoerenm · 2021-11-16T14:39:55Z

src/tracking_state.jl

+It is determined based on the signal, which is given in the track
+function.
+"""
+struct DownconvertedSignalGPU{


The DownconvertedSignalGPU is never used, isn't it? I guess we can get rid of this structure.

zsoerenm · 2021-11-16T14:41:39Z

src/tracking_state.jl

-        DS <: DownconvertedSignalCPU, # Union{DownconvertedSignalCPU, CuArray{ComplexF32}}
-        CAR <: CarrierReplicaCPU, # Union{CarrierReplicaCPU, CuArray{ComplexF32}}
-        COR <: Vector{Int8}, # Union{Vector{Int8}, CuArray{Float32}}
+        DS <: Union{DownconvertedSignalCPU, DownconvertedSignalGPU},


I think you can leave DownconvertedSignalCPU even for the GPU case and initialize that as an empty array.

zsoerenm · 2021-11-16T14:41:59Z

src/tracking_state.jl

-        CAR <: CarrierReplicaCPU, # Union{CarrierReplicaCPU, CuArray{ComplexF32}}
-        COR <: Vector{Int8}, # Union{Vector{Int8}, CuArray{Float32}}
+        DS <: Union{DownconvertedSignalCPU, DownconvertedSignalGPU},
+        CAR <: Union{CarrierReplicaCPU, CarrierReplicaGPU},


I think you can leave CarrierReplicaCPU even for the GPU case and initialize that as an empty array.

src/tracking_state.jl

coezmaden · 2021-11-17T07:10:41Z

@zsoerenm Thanks for the feedback. ccb75f0 deals with the first two comments, the others are handeled by 2a24f7d. The latter even improves the performance:

Before 2a24f7d

julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):   82.073 μs …  22.217 ms  ┊ GC (min … max): 0.00% … 38.81%
 Time  (median):      86.151 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   103.747 μs ± 487.141 μs  ┊ GC (mean ± σ):  3.95% ±  0.84%

  ▆█▇▄▄▅▄▃▂▁                                                    ▂
  ██████████▇▆▆▇▇█▇▆▇▇▇▆▇▇▆▆▅▅▅▅▅▅▅▅▅▆▄▄▄▅▃▅▅▅▃▁▁▁▄▄▁▁▃▄▁▁▄▃▄▅▄ █
  82.1 μs       Histogram: log(frequency) by time        211 μs <

 Memory estimate: 29.80 KiB, allocs estimate: 418.

After 2a24f7d

julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  64.100 μs …  21.582 ms  ┊ GC (min … max): 0.00% … 38.88%
 Time  (median):     66.980 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   82.030 μs ± 475.433 μs  ┊ GC (mean ± σ):  4.97% ±  0.86%

   ██                                                           
  ▄██▇▃▃▃▄▄▃▃▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▂▂▁▂ ▂
  64.1 μs         Histogram: frequency by time          134 μs <

 Memory estimate: 29.33 KiB, allocs estimate: 406.

zsoerenm · 2021-11-17T07:24:04Z

src/downconvert_and_correlate.jl

+    if sample_idx <= num_samples && antenna_idx <= num_ants && corr_idx <= num_corrs
+        # generate carrier
+        carrier_im[sample_idx], carrier_re[sample_idx] = CUDA.sincos(2π * ((sample_idx - 1) * carrier_frequency / sampling_frequency + carrier_phase))


For what purpose do you save the sincos result in a vector? Is it for performance improvements in the future? If this is the case, let's dismiss that here to have a clean baseline.

zsoerenm · 2021-11-17T07:25:11Z

src/tracking_state.jl

@@ -51,6 +50,7 @@ function DownconvertedSignalCPU(num_ants::NumAnts{N}) where N
    )
 end

+@inline gen_blank_carrier_gpu(system::S, num_samples) where {CO <: CuMatrix,S <: AbstractGNSS{CO}} = StructArray{ComplexF32}((CuArray{Float32}(undef, num_samples), CuArray{Float32}(undef, num_samples)))


I think this can be dismissed, if you do not save the sincos result in a vector.

This isn't used anymore, is it? So I think you can rid of this, too.

zsoerenm · 2021-11-17T07:28:00Z

src/tracking_state.jl

-        CAR <: CarrierReplicaCPU, # Union{CarrierReplicaCPU, CuArray{ComplexF32}}
-        COR <: Vector{Int8}, # Union{Vector{Int8}, CuArray{Float32}}
+        DS <: Union{DownconvertedSignalCPU, Nothing},
+        CAR <: Union{CarrierReplicaCPU, StructVector{ComplexF32, NamedTuple{(:re, :im), Tuple{CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}, CuArray{Float32, 1, CUDA.Mem.DeviceBuffer}}}, Int64}},


Same as above.

coezmaden · 2021-11-17T09:20:24Z

@zsoerenm I have implemented your proposal. Now that I thought about it, I too decided it made more sense to just leave everything at nothing. No significant regression in the execution time.

Before eab881c:

julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  65.749 μs …  24.988 ms  ┊ GC (min … max): 0.00% … 46.31%
 Time  (median):     69.804 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   92.254 μs ± 509.974 μs  ┊ GC (mean ± σ):  4.92% ±  0.90%

  █▆▅▃▂▁▁▁▁                                                    ▁
  ██████████▇▆▆▄▄▆▆▅▅▄▄▅▄▆▄▅▄▃▃▅▃▃▄▅▅▅▃▄▄▅▄▅▆▅▃▅▁▅▅▅▄▄▃▃▁▁▅▄▄▄ █
  65.7 μs       Histogram: log(frequency) by time       381 μs <

 Memory estimate: 29.33 KiB, allocs estimate: 406.

After eab881c:

julia> @benchmark track($signal_gpu, $state_gpu, $sampling_frequency)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  65.857 μs …  25.658 ms  ┊ GC (min … max): 0.00% … 33.48%
 Time  (median):     69.429 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   87.554 μs ± 426.402 μs  ┊ GC (mean ± σ):  3.05% ±  0.63%

  █▆▄▃▁▁▁▁▁▁                                                   ▁
  ███████████▆▆▆▇▅▆▅▅▄▅▅▄▅▅▆▄▄▅▄▄▁▁▄▄▄▅▃▁▄▄▃▃▃▅▅▄▄▅▃▄▅▄▄▄▄▄▄▄▃ █
  65.9 μs       Histogram: log(frequency) by time       366 μs <

 Memory estimate: 26.23 KiB, allocs estimate: 388.

zsoerenm

Alright, looks good.
Just 2 minor code smells: One is duplicate code and the other is a function, that isn't used anymore.
BTW: I'm surprised that writing to a CuArray in the kernel does not generate an overhead.

zsoerenm · 2021-11-17T09:57:43Z

src/tracking_loop.jl

+function choose(replica::Nothing, signal::AbstractArray)
+    nothing
+end


Those 3 lines are exactly the same as 370-372. So I think we can get rid of this.

zsoerenm · 2021-11-17T09:58:38Z

src/tracking_state.jl

@@ -51,6 +50,7 @@ function DownconvertedSignalCPU(num_ants::NumAnts{N}) where N
    )
 end

+@inline gen_blank_carrier_gpu(system::S, num_samples) where {CO <: CuMatrix,S <: AbstractGNSS{CO}} = StructArray{ComplexF32}((CuArray{Float32}(undef, num_samples), CuArray{Float32}(undef, num_samples)))


This isn't used anymore, is it? So I think you can rid of this, too.

coezmaden · 2021-11-17T10:22:05Z

I have removed the mentioned code smell.

BTW: I'm surprised that writing to a CuArray in the kernel does not generate an overhead.

Yeah, strange. Wonder why that is, I have run the benchmark in multiple independent sessions just to be sure. The results stay the same.

Can Ozmaden and others added 30 commits April 12, 2020 23:33

initial commit in the gpu branch

64b0064

GPU function blueprints for downcovert & car. rep.

5aad150

Gpu carrier generation draft

4d602f9

downconvert loop blueprint

f1952f5

Implement carrier replica gpu function

808b7d6

Implement code replica and corr on GPU

d905623

Delete unnecessary comments

9cbca18

Reflect the development in readme

c4e1616

Adjust Gain Controlled Signal for a GPU Signal

a9869a3

Fix constructor, make use of vector operations

43c07fd

Merge pull request #1 from JuliaGNSS/master

186acde

Update forked master due to hotfix

Remove .vscode garbage, adjust global gitignore

0c0b49e

Change Float16 occurances to Float32 due to perf.

4217d83

Change GPU carrier_replica to StructArray, adjust the calculation method

ad2b607

Merge pull request #2 from ozmaden/master

82bcf0f

Incorporate the hotfix from the master repo

Complete the GPU downconvert function

ecc981e

Add CUDA package dependency

c8fc7a4

GPU correlation anc code replica blueprint

80e11e4

Merge branch 'feature/gpu-accelerated-correlation' of https://github.…

b8ae78b

…com/ozmaden/Tracking.jl into feature/gpu-accelerated-correlation

Fix syntax

4cb1e05

Correlation using dot product on GPU

e263962

Fix GPU correlate parameter types

269e956

Add new dependencies

9464364

Fix algorithm, use @views macro for performance

2927448

Optimize carrier generation by taking fewer steps

ea93620

Union types for the carrier and code

7cdacfa

Gain control for the gpu signal implemented

a5b8c31

GPU code replica, GNSSSignals 28d32c4324e40a0b93391b06820deea98112a02d

629b96b

Merge pull request #3 from JuliaGNSS/master

16e69ed

Compat LoopVectorization

Merge pull request #4 from ozmaden/master

eae463d

Compat LoopVectorization

coezmaden added 14 commits November 7, 2021 08:09

Allow scalar indexing for cn0_estimation test

c7b31b6

Allowscalar deprecation

2bd5482

Solve scalar indexing in accumaltor results

0ce8e02

Fix GPU multi antenna tracking state

363159e

Seperate functions for matrix and vector cases

9d278e0

Allowscalar for tracking loop tests

b4d66c1

Remove CUDA broadcasting functions, clean comments

10d7960

Update readme with a CUDA.jl example

63d21d2

Merge ozmaden/Tracking#14

465b422

Merge branch 'master' of git://github.com/JuliaGNSS/Tracking.jl into …

2c235dc

…JuliaGNSS-master

Merge branch 'JuliaGNSS-master' into feature/gpu_kernels

504afbb

Adjust GPU functions according to the change JuliaGNSS#31

e5a8398

Make CUDA test names consistent

eb51208

Add multiple antenna GPU test

e45cd77

coezmaden added 2 commits November 14, 2021 10:57

Fix examples according to JuliaGNSS#31

8f3464d

Check for signal and codes type consistency

90358a9

Add Julia BuildKite CI for CUDA tests

81d94c3

zsoerenm reviewed Nov 16, 2021

View reviewed changes

coezmaden added 2 commits November 17, 2021 07:10

Remove leftovers

ccb75f0

Remove unnecessary structs

2a24f7d

zsoerenm reviewed Nov 17, 2021

View reviewed changes

Remove the unnecessary carrier vector

eab881c

zsoerenm reviewed Nov 17, 2021

View reviewed changes

Remove unused functions and duplicates

724dcf6

zsoerenm merged commit fc50604 into JuliaGNSS:master Nov 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bring CUDA support to `Tracking.jl` #33

Bring CUDA support to `Tracking.jl` #33

coezmaden commented Nov 14, 2021

codecov-commenter commented Nov 14, 2021 •

edited

Loading

coezmaden commented Nov 14, 2021 •

edited

Loading

zsoerenm left a comment

zsoerenm Nov 16, 2021

zsoerenm Nov 16, 2021

zsoerenm Nov 16, 2021

zsoerenm Nov 16, 2021

zsoerenm Nov 16, 2021

zsoerenm Nov 16, 2021

coezmaden commented Nov 17, 2021

zsoerenm Nov 17, 2021

zsoerenm Nov 17, 2021

zsoerenm Nov 17, 2021

zsoerenm Nov 17, 2021

coezmaden commented Nov 17, 2021

zsoerenm left a comment •

edited

Loading

zsoerenm Nov 17, 2021

zsoerenm Nov 17, 2021

coezmaden commented Nov 17, 2021

Bring CUDA support to Tracking.jl #33

Bring CUDA support to Tracking.jl #33

Conversation

coezmaden commented Nov 14, 2021

codecov-commenter commented Nov 14, 2021 • edited Loading

Codecov Report

coezmaden commented Nov 14, 2021 • edited Loading

zsoerenm left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coezmaden commented Nov 17, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coezmaden commented Nov 17, 2021

zsoerenm left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coezmaden commented Nov 17, 2021

Bring CUDA support to `Tracking.jl` #33

Bring CUDA support to `Tracking.jl` #33

codecov-commenter commented Nov 14, 2021 •

edited

Loading

coezmaden commented Nov 14, 2021 •

edited

Loading

zsoerenm left a comment •

edited

Loading